home *** CD-ROM | disk | FTP | other *** search
- From: chughes@maths.tcd.ie (Conrad Hughes)
- Subject: Re: 32/64 bit square root.
- Date: 25 Sep 91 16:30:11 GMT
-
- Here's the best code I've managed yet; it manages to fit into four
- instructions per bit. If anyone has managed better, please let me
- know.. If you do use the code, it'd be nice to receive a mention
- :-)
- DEFFNsqrt(N,R,T,S):LOCALA%:IFS<>N [OPTpass:MOV S,N:]
- [OPTpass:MOV R,#0:]:FORA%=7TO0STEP-1:[OPTpass:ADD T,R,#1<<A%+15
- CMP S,T,LSL#A%+1:SUBGE S,S,T,LSL#A%+1:ADDGE R,R,#1<<A%+16:]NEXT:[OPTpass
- ADD T,R,#1<<14:CMP S,T:SUBGE S,S,T:ADDGE R,R,#1<<15:]FORA%=2TO15:[OPTpass
- ADD T,R,#1<<15-A%:RSBS S,T,S,LSL#1:ADDLT S,S,T:ADDGE R,R,#1<<16-A%:]NEXT
- [OPTpass:ADD T,R,R:ADD T,T,#1:RSBS S,T,S,LSL#2:ADDLT S,S,T:ADDGE R,R,#1
- CMP S,R:ADDGE R,R,#1:]=0
- ..defines a function which calculates the square root of a number
- stored in one register, with the fixed point between bits 15 and 16;
- I didn't write down behaviour for negative numbers, but you should
- be able to test this out for yourself, and it shouldn't be too
- difficult to extend to different formats. To convert a number into
- this format, you just multiply it by 65536, and to convert back, you
- divide by 65536 (i.e., 1.000 would be represented as 65536 in a
- register).
-
- So, for example, you want R1=SQRT(R0):
- FNsqrt(0,1,2,3) will preserve the contents of R0, and use R2 and R3
- for data - both will be corrupted.
- FNsqrt(0,1,2,0) will corrupt R0, but doesn't need R3.
- There are no other circumstances under which two of the registers
- the routine uses can safely be the same.
-
- The loops are unwrapped for speed (actually using loops would take
- up much less space, but would probably use up another register, and
- make the routine twice as slow). The routine takes up 400 or 404
- bytes, depending on whether N=S or not.
-
- One thing - this is slow :-}, so wherever possible you should avoid
- using square roots.. Similarly division.
-
- Conrad
-
-
-
- From: gcwillia@daisy.waterloo.edu (Graeme Williams)
- Subject: Re: The (non-charity) ray tracer
- Date: 20 Nov 91 19:21:07 GMT
-
- The algorithm I'd use would be a Newton-Raphson iteration scheme designed
- to solve x^2=z. (Whether this is the fastest method I don't know)
-
- The way this works is you make a guess for sqrt(z), say x_0. (The guess
- can be very rough, I'd do something like find out how many bits in z
- (say there were 23) half that (11 or 12) and then guess x_0 (2^11 or 2^12.)
-
- One then makes successive approximations via
-
- x_{n+1} = x_{n} - (x_{n}^2 - z)/2/x_{n}
-
- This scheme will converge very rapidly to sqrt(z) if your initial guess
- is at all close to z (give or take a factor of 2 or 4 is close enough).
- You might want to write (x_{n}^2-z) as (x_{n}+z)*(x_{n}-z) it might be
- quicker (I've no idea).
-
- As an example: if z=529020918
-
- and we guess x0=16384
- then we find x1=24336.43719
- x2=23037.12504
- x3=23000.48392
- x4=23000.45473
- x5=23000.45473
- x6=23000.45473
-
- so you can see 4 or maybe 5 iterations at most is enough. Each iteration
- (if you arrange things *carefully*) will need only a 16x16 bit multiply
- and a 32x16 bit divide.
-
- Graeme Williams
-
-
-
- From: torbenm@diku.dk (Torben AEgidius Mogensen)
- Date: 21 Nov 91 15:59:37 GMT
- Sender: torbenm@freke.diku.dk
-
- gcwillia@daisy.waterloo.edu (Graeme Williams) writes:
- >The algorithm I'd use would be a Newton-Raphson iteration scheme designed
- >to solve x^2=z. (Whether this is the fastest method I don't know)
- >...
- >so you can see 4 or maybe 5 iterations at most is enough. Each iteration
- >(if you arrange things *carefully*) will need only a 16x16 bit multiply
- >and a 32x16 bit divide.
-
- There is a *much* faster way. If your number is of the form a*2^n,
- where 0.5 <= a < 1.0 then do the following:
-
- sqrt(a*2^(2n)) = sqrt(a)*2^n
- sqrt(a*2^(2n+1)) = sqrt(a/2)*2^(n+1)
-
- This reduces the problem to finding a square root of a number a,
- 0.25 <= a < 1.0. This can be done by this method, which should have
- a cost comparable to 2 or 3 divisions.
-
- { 0.25 <= a < 1.0 }
- x := 4*(a-0.25);
- r := 1;
- n := 1;
- while (n<number-of-bits) and (a<>0) do begin
- { invariant: r^2+x = 4^n*a, x<2*r+1 }
- n := n+1;
- t := 4*r+1;
- x := 4*x;
- r := 2*r;
- if x >= t then begin
- x := x-t;
- r := r+1
- end
- end;
- r := r/2^n;
- { sqrt(a)-r < 2^-n }
-
- Note that only addition, subtractions and multiplications by 2 or 4
- are used in each part of the loop. At the end a single n-bit shift is
- done. Note that r and t are integers, while x is a real.
-
- The (a<>0) test allows you to exit the loop if an exact root is found,
- but it can be omitted without causing error (so the loop can be
- unwound).
-
- We can reformulate the algorithm to keep x < 1.0 throughout the
- algorithm:
-
- { 0.25 <= a < 1.0 }
- x := (a-0.25)/2;
- r := 1;
- for n := 2 to number-of-bits do begin
- { invariant: r^2+2^n*x= 4^n*a, x<(2*r+1)*2^-n }
- t := 4*r+1;
- x := 2*x;
- r := 2*r;
- if x >= t/2^(n+1) then begin
- x := x-t/2^(n+1);
- r := r+1
- end
- end;
- r := r/2^number-of-bits;
- { sqrt(a)-r < 2^-number-of-bits }
-
- Assuming we have a 32 bit mantissa which is stored with the most
- significant bit as 0.5, we would get the following ARM code, which
- uses 213 cycles to get a 32 bit result. A test of x=0 could be added
- at every iteration. This would make minimum time shorter, but assuming
- there is no large number of squares in the input, this would make the
- average time longer.
-
- \ on entry Ra holds a
- \ on exit Rr holds sqrt(a)
- \ uses Rx and Rt (either can be equal to Ra)
- SUB Rx,Ra,#2^30
- MOV Rx,Rx LSR #1 \ x := (a-0.25)/2;
- MOV Rr,#1 \ r := 1;
-
- MOV Rt,Rr ASL #2
- ADD Rt,Rt,#1 \ t := 4*r+1;
- MOV Rx,Rx ASL #1 \ x := 2*x;
- MOV Rr,Rr, ASL #1 \ r := 2*r;
- SUBS Rt,Rx,Rt ASL #29 \ if x >= t/2^(n+1) then begin
- MOVPL Rx,Rt \ x := x-t/2^(n+1);
- ADDPL Rr,Rr,#1 \ r := r+1
-
- ...
-
- MOV Rt,Rr ASL #2
- ADD Rt,Rt,#1 \ t := 4*r+1;
- MOV Rx,Rx ASL #1 \ x := 2*x;
- MOV Rr,Rr, ASL #1 \ r := 2*r;
- SUBS Rt,Rx,Rt ASL #0 \ if x >= t/2^(n+1) then begin
- MOVPL Rx,Rt \ x := x-t/2^(n+1);
- ADDPL Rr,Rr,#1 \ r := r+1
-
- \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
-
- We can actually get this better by avoiding the doubling of r in
- each iteration. This way r and t are no longer integers, but the code
- is better anyway:
-
- { 0.25 <= a < 1.0 }
- x := a-0.25;
- r := 0.5;
- for n := 2 to number-of-bits do begin
- { invariant: r^2+x/2^(n-1) = a, x<r+2^-(n+1) }
- t := r+2^-(n+2);
- x := 2*x
- if x >= t then begin
- x := x-t;
- r := r+2^-(n+1)
- end
- end;
- { sqrt(a)-r < 2^-number-of-bits }
-
- In ARM this uses 149 cycles to get a 32 bit result. Try to better that!
-
- \ on entry Ra holds a
- \ on exit Rr holds sqrt(a)
- \ uses Rx and Rt (either can be equal to Ra)
- SUB Rx,Ra,#2^30 \ x := a-0.25;
- MOV Rr,#2^31 \ r := 0.5;
-
- ADD Rt,Rr,#2^28 \ t := r+2^-(n+2);
- MOV Rx,Rx ASL #1 \ x := 2*x;
- SUBS Rt,Rx,Rt \ if x >= t then begin
- MOVPL Rx,Rt \ x := x-t;
- ADDPL Rr,Rr,#2^29 \ r := r+2^-(n+1)
-
- ...
-
- ADD Rt,Rr,#2^0 \ t := r+2^-(n+2);
- MOV Rx,Rx ASL #1 \ x := 2*x;
- SUBS Rt,Rx,Rt \ if x >= t then begin
- MOVPL Rx,Rt \ x := x-t;
- ADDPL Rr,Rr,#2^1 \ r := r+2^-(n+1)
-
- \ t := r+2^-(n+2);
- \ { 2^-(n+2) is less than precision }
- \ x := 2*x;
- RSBS Rt,Rr,Rx ASL #1 \ if x >= t then begin
- \ x := x-t; { x not needed }
- ADDPL Rr,Rr,#2^0 \ r := r+2^-(n+1)
-
- \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
-
-
- Note: I haven't tried this code on my Archimedes, as it is at home and
- I am at work. I have, however, tried the pseudo-Pascal code.
-
-
- From: torbenm@diku.dk (Torben AEgidius Mogensen)
- Date: 22 Nov 91 10:46:26 GMT
-
- I made a small error in this. The correct code is:
-
- { 0.25 <= a < 1.0 }
- x := a-0.25;
- r := 0.5;
- for n := 2 to number-of-bits do begin
- { invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) }
- t := r+2^-(n+1);
- x := 2*x
- if x >= t then begin
- x := x-t;
- r := r+2^-n
- end
- end;
- { sqrt(a)-r < 2^-number-of-bits }
-
- There are, however, still a small problem: x might be greater than 1
- after the doubling. We can solve this by delaying the doubling:
-
- { 0.25 <= a < 1.0 }
- x := a-0.25;
- r := 0.5;
- for n := 2 to number-of-bits do begin
- { invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) }
- t := r+2^-(n+1)
- if x >= t/2 then begin
- x := x-t/2;
- r := r+2^-n
- end;
- x := 2*x
- end;
- { sqrt(a)-r < 2^-number-of-bits }
-
- Also, my ARM code used unsigned numbers but compared them as if they
- were signed (using PL). Comparison of unsigned numbers requires HS.
- The code is still untested, so there might be further bugs. Note also
- that you can't write the constant 2^31 in BASIC assembler. Use 1<<31
- instead. The number of cycles have increased to 154, as an extra pass
- through the loop is required to get 32 bit precision:
-
- \ on entry Ra holds a
- \ on exit Rr holds sqrt(a)
- \ uses Rx and Rt (either can be equal to Ra)
- SUB Rx,Ra,#2^30 \ x := a-0.25;
- MOV Rr,#2^31 \ r := 0.5;
-
- ADD Rt,Rr,#2^29 \ t := r+2^-(n+1);
- SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin
- MOVHS Rx,Rt \ x := x-t/2;
- ADDHS Rr,Rr,#2^30 \ r := r+2^-n
- MOV Rx,Rx,ASL #1 \ x := 2*x;
-
- ...
-
- ADD Rt,Rr,#2^0 \ t := r+2^-(n+1);
- SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin
- MOVHS Rx,Rt \ x := x-t/2;
- ADDHS Rr,Rr,#2^1 \ r := r+2^-n
- MOV Rx,Rx,ASL #1 \ x := 2*x;
-
- \ t := r+2^-(n+1);
- \ { 2^-(n+1) is less than precision }
- SUBS Rt,Rx,Rr,LSR #1 \ if x >= t/2 then begin
- \ x := x-t/2; { x not needed }
- ADDHS Rr,Rr,#2^0 \ r := r+2^-n
- \ x := 2*x; { x not needed }
- \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
-
-
- Torben Mogensen (torbenm@diku.dk)
-
-
-
- From: dseal@armltd.co.uk (David Seal)
- Subject: Square roots (was Re: The (non-charity) ray tracer)
- Date: 21 Nov 91 15:52:14 GMT
-
- The Newton-Raphson technique is a good one when you've got a fast division -
- e.g. if you've got division in hardware (and haven't got square root in
- hardware :-). However, if you're going to do division by a software routine
- (as you must on an ARM), there is a "long square root" algorithm which is
- very similar to the "long division" algorithms used to do division in
- software. It takes only a bit more time than one division *in total*,
- compared with a division, an addition and a shift *for each of several
- iterations* in Newton-Raphson. I would therefore recommend it for ARM square
- root routines when speed is important.
-
- I won't supply code, since (a) I don't know what type of numbers you want to
- take square roots of, how accurate a result you want, etc. - these details
- will obviously affect the code; (b) I don't have time to write code, whereas
- I have a written description of the algorithm on file which I can simply
- include here!
-
- I'll illustrate the "long square root" algorithm in base 10 - suppose we
- want to take the square roots of 2 and 20:
-
- (A) First write down the number you want to take the square root of, split
- into groups of two digits, with one split at the decimal point. In these
- particular cases, this gives us:
-
- ----------------- -----------------
- V 02. 00 00 00 ... V 20. 00 00 00 ...
-
- (Note that the two numbers chosen have the same pattern of digits, but
- different splits because of the different positions of the decimal
- point. This is the only reason they produce different answers.)
-
- (B) Initialise by finding the largest square which is smaller than the first
- pair of digits: in the first case the first pair of digits is 02 = 2, so
- this square is 1^2 = 1, and in the second the first pair of digits is
- 20, so this square is 4^2 = 16. Subtract the square from the first pair
- of digits, and put its square root in as the first digit of the result:
-
- 1. 4.
- ----------------- -----------------
- V 02. 00 00 00 ... V 20. 00 00 00 ...
- - 1 -16
- -- --
- 1 4
-
- Then bring the next pair of digits down:
-
- 1. 4.
- ----------------- -----------------
- V 02. 00 00 00 ... V 20. 00 00 00 ...
- - 1 -16
- ------ ------
- 100 400
-
- (C) To find the next digit of the result: take twice the answer so far. Look
- at numbers of the form N * (twice answer so far concatenated with N) for
- N a digit. The largest one of these that is still less than or equal to
- the remainder we've got gives the next digit of the result, and the
- value of N * (twice answer so far concatenated with N) is what we should
- subtract from the remainder before bringing down the next pair of
- digits:
-
- Twice answer so far is 2*1=2. Twice answer so far is 2*4=8.
- 0*20 = 0, less than 100 0*80 = 0, less than 400
- 1*21 = 21, less than 100 1*81 = 81, less than 400
- 2*22 = 44, less than 100 2*82 = 164, less than 400
- 3*23 = 69, less than 100 3*83 = 249, less than 400
- 4*24 = 96, less than 100 4*84 = 336, less than 400
- 5*25 = 125, more than 100 5*85 = 425, more than 400
- So the next digit of the answer So the next digit of the answer
- is 4, and we must subtract 96 is 4, and we must subtract 336
-
- 1. 4 4. 4
- ----------------- -----------------
- V 02. 00 00 00 ... V 20. 00 00 00 ...
- - 1 -16
- ------ ------
- 100 400
- - 96 -336
- ------ ------
- 400 6400
-
- (D) Repeat step (C) as many times as you need result digits - I'll do it
- once more for the examples:
-
- Twice answer so far is 2*14=28. Twice answer so far is 2*44=88.
- 0*280 = 0, less than 400 0*880 = 0, less than 6400
- 1*281 = 281, less than 400 1*881 = 881, less than 6400
- 2*282 = 564, more than 400 2*882 = 1764, less than 6400
- So the next digit of the answer :
- is 1, and we must subtract 281 :
- 7*887 = 6209, less than 6400
- 8*888 = 7104, more than 6400
- So the next digit of the answer
- is 7, and we must subtract 6209
-
- 1. 4 1 4. 4 7
- ----------------- -----------------
- V 02. 00 00 00 ... V 20. 00 00 00 ...
- - 1 -16
- ------ ------
- 100 400
- - 96 -336
- ------ ------
- 400 6400
- - 281 -6209
- ------- -------
- 11900 19100
-
- As to why the method works, I'll give some hints and leave you to work out
- the details yourself. To give the rough reason for it, note that the
- "remainder" in the above calculations is actually the difference between the
- number you are trying to take the square root of and the square of the
- answer so far:
-
- 2 20
- = 1^2 + 1 = 4^2 + 4
- = 1.4^2 + 0.04 = 4.4^2 + 0.64
- = 1.41^2 + 0.0119 = 4.47^2 + 0.0191
- = ... = ...
-
- [ Aside: this fact yields another useful property of the algorithm, which is
- that (as in long division) your answer is exactly right if the remainder
- becomes 0 and there are no further non-zeros to bring down. This can be
- very helpful e.g. for calculating square roots in floating point systems
- that conform to IEEE standard 754. ]
-
- Then prove this - it depends on the identity:
-
- (10x+y)^2 - (10x)^2 = 20xy + y^2
- = (20x + y) * y
-
- The quantities (20x + y) * y are what the digit calculation in part (C) is
- actually determining.
-
- One important observation: this algorithm simplifies immensely in binary!
- I'll run through it, illustrating it for the case of the square root of two
- again:
-
- (A) Write down the argument, split into pairs aligned with the binary point:
-
- --------------------
- V 10. 00 00 00 00 ...
-
- (B) If the first pair is 00, the first digit of the answer is 0; otherwise
- it is 1. Subtract the square of this first digit from the first pair and
- bring down the next pair:
-
- 1.
- --------------------
- V 10. 00 00 00 00 ...
- - 1
- ------
- 100
-
- (C) There are only two possible next digits, namely 0 and 1. But 0 * (twice
- the answer so far concatenated with 0) is always 0, which must be less
- than or equal to the remainder. So all that matters is whether 1 *
- (twice the answer so far concatenated with 1) is less than or equal to
- the remainder: if it is, the next result digit is 1; if it is greater
- than the remainder, the next result digit is 0. One more thing: twice
- the answer so far concatenated with 1 is the answer so far concatenated
- with 01.
-
- So the result digit selection procedure becomes: concatenate the answer
- so far with 01, and compare with the remainder. If greater, leave the
- remainder unchanged and the result digit is 0; if less than or equal,
- subtract it from the remainder and the result digit is 1.
-
- To continue the example:
-
- Result so far concatenated with 01 is 101, which is greater than 100,
- so result digit is 0 and subtract 0:
-
- 1. 0
- --------------------
- V 10. 00 00 00 00 ...
- - 1
- ------
- 100
- - 0
- ------
- 10000
-
- Result so far concatenated with 01 is 1001, which is less than 10000,
- so result digit is 1 and subtract 1001:
-
- 1. 0 1
- --------------------
- V 10. 00 00 00 00 ...
- - 1
- -------
- 100
- - 0
- -------
- 10000
- - 1001
- --------
- 11100
-
- Result so far concatenated with 01 is 10101, which is less than 11100,
- so result digit is 1 and subtract 10101:
-
- 1. 0 1 1
- --------------------
- V 10. 00 00 00 00 ...
- - 1
- -------
- 100
- - 0
- -------
- 10000
- - 1001
- --------
- 11100
- - 10101
- ---------
- 11100
-
- Result so far concatenated with 01 is 101101, which is greater than
- 11100, so result digit is 0 and subtract 0:
-
- 1. 0 1 1 0
- --------------------
- V 10. 00 00 00 00 ...
- - 1
- -------
- 100
- - 0
- -------
- 10000
- - 1001
- --------
- 11100
- - 10101
- ---------
- 11100
- - 0
- ----------
- 1110000
-
- One further point: Step (B) is just a special case of step (C) if we
- initialise the "square root so far" to 0 - i.e. you don't need special
- case hardware for the first iteration.
-
- Hope this helps.
-
- David Seal
- dseal@armltd.co.uk
-